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Abstract 


We introduce a new family of deep neural network models. Instead of specifying a 
discrete sequence of hidden layers, we parameterize the derivative of the hidden 
state using a neural network. The output of the network is computed using a black- 
box differential equation solver. These continuous-depth models have constant 
memory cost, adapt their evaluation strategy to each input, and can explicitly trade 
numerical precision for speed. We demonstrate these properties in continuous-depth 
residual networks and continuous-time latent variable models. We also construct 
continuous normalizing flows, a generative model that can train by maximum 
likelihood, without partitioning or ordering the data dimensions. For training, we 
show how to scalably backpropagate through any ODE solver, without access to its 
internal operations. This allows end-to-end training of ODEs within larger models. 


1 Introduction ; 
Residual Network ODE Network 
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Models such as residual networks, recurrent neural 
network decoders, and normalizing flows build com- 4 
plicated transformations by composing a sequence of 
transformations to a hidden state: 


hi = h; + f (he, 0+) (1) 2 


where t € {0...T} and h; € RP. These iterative , 
updates can be seen as an Euler discretization of a 

continuous transformation (Lu et al., 2017; Haber o 7 7 o o 5 
and Ruthotto, 2017; Ruthotto and Haber, 2018). Input/Hidden/Output Input/Hidden/Output 


What happens as we add more layers and take smaller Figure 1: Left: A Residual network defines a 
steps? In the limit, we parameterize the continuous discrete sequence of finite transformations. 
dynamics of hidden units using an ordinary differen- Right: A ODE network defines a vector 
tial equation (ODE) specified by a neural network: field, which continuously transforms the state. 


dh(t Both: Circles represent evaluation locations. 
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Starting from the input layer h(0), we can define the output layer h(T) to be the solution to this 
ODE initial value problem at some time T. This value can be computed by a black-box differential 
equation solver, which evaluates the hidden unit dynamics f wherever necessary to determine the 
solution with the desired accuracy. Figure 1 contrasts these two approaches. 


Defining and evaluating models using ODE solvers has several benefits: 
Memory efficiency In Section 2, we show how to compute gradients of a scalar-valued loss with 
respect to all inputs of any ODE solver, without backpropagating through the operations of the solver. 


Not storing any intermediate quantities of the forward pass allows us to train our models with constant 
memory cost as a function of depth, a major bottleneck of training deep models. 
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Adaptive computation Euler’s method is perhaps the simplest method for solving ODEs. There 
have since been more than 120 years of development of efficient and accurate ODE solvers (Runge, 
1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth 
of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to 
achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with 
problem complexity. After training, accuracy can be reduced for real-time or low-power applications. 


Scalable and invertible normalizing flows An unexpected side-benefit of continuous transforma- 
tions is that the change of variables formula becomes easier to compute. In Section 4, we derive 
this result and use it to construct a new class of invertible density models that avoids the single-unit 
bottleneck of normalizing flows, and can be trained directly by maximum likelihood. 


Continuous time-series models Unlike recurrent neural networks, which require discretizing 
observation and emission intervals, continuously-defined dynamics can naturally incorporate data 
which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model. 


2 Reverse-mode automatic differentiation of ODE solutions 


The main technical difficulty in training continuous-depth networks is performing reverse-mode 
differentiation (also known as backpropagation) through the ODE solver. Differentiating through 
the operations of the forward pass is straightforward, but incurs a high memory cost and introduces 
additional numerical error. 


We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity 
method (Pontryagin et al., 1962). This approach computes gradients by solving a second, aug- 
mented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly 
with problem size, has low memory cost, and explicitly controls numerical error. 


Consider optimizing a scalar-valued loss function L(), whose input is the result of an ODE solver: 


L(a(t1)) =L (2%) a f F(a(t),t,0)at) = L (ODESolve(z(to), f,to.t1,9)) 6) 


To optimize L, we require gradients with respect 
to 0. The first step is to determining how the 
gradient of the loss depends on the hidden state 
z(t) at each instant. This quantity is called the 
adjoint a(t) = °L/az(t). Its dynamics are given 
by another ODE, which can be thought of as the 
instantaneous analog of the chain rule: 


a a(ti+1) Adjoint State 
a a Nu | CORE ZCN ON 
; ðL a(ty = 
oe Bala dt Oz 
ðz(to) | alti) é 
A“ na. We can compute ĉŁ/əz(to) by another call to an 
| 


t | t t> ODE solver. This solver must run backwards, 
to ti t feta tw starting from the initial value of 24/az(t,). One 


; : a complication is that solving this ODE requires 
Figure 2: Reverse-mode differentiation of an ODE the knowing value of z(t) along its entire tra- 


solution. The adjoint sensitivity method solves jectory. However, we can simply recompute 


an augmented ODE backwards in time. The aug- z(t) backwards in time together with the adjoint, 
mented system contains both the original state and starting from its final value z(t; ). 


the sensitivity of the loss with respect to the state. : i í 

If the loss depends directly on the state at multi- Computing the gradients with respect to the pa- 
ple observation times, the adjoint state must be rameters @ requires evaluating a third integral, 
updated in the direction of the partial derivative of Which depends on both z(t) and a(t): 


the loss with respect to each observation. dL to t).t.0 
ti -f ar êf J , a 
dé fi 00 


State 


(5) 


The vector-Jacobian products a(t)” ot and alt) TH in (4) and (5) can be efficiently evaluated by 


automatic differentiation, at a time cost similar to that of evaluating f. All integrals for solving z, a 


and on can be computed in a single call to an ODE solver, which concatenates the original state, the 
adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the 
necessary dynamics, and call an ODE solver to compute all gradients at once. 


Algorithm 1 Reverse-mode derivative of an ODE initial value problem 


Input: dynamics parameters 0, start time to, stop time t4, final state z(t), loss gradient ĉŁ/əz(tı) 


so = [z(t1), BE)? Ojoj] > Define initial augmented state 
def aug_dynamics([z(t), a(t), -], t, 0): > Define dynamics on augmented state 
return [f (z(t), t, 0), —a(t)" of ,—a(t)! of ] > Compute vector-Jacobian products 
[z(to), Bet 2L] = ODESolve (so, aug_dynamics, tı, to, 0) > Solve reverse-time ODE 
return PONE on > Return gradients 


Most ODE solvers have the option to output the state z(t) at multiple times. When the loss depends 
on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate 
solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint 
must be adjusted in the direction of the corresponding partial derivative 04/az(t;). 


The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of 
Algorithm 1 including derivatives w.r.t. tọ and tı can be found in Appendix C. Detailed derivations 
are provided in Appendix B. Appendix D provides Python code which computes all derivatives for 
scipy.integrate.odeint by extending the autograd automatic differentiation package. This 
code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al., 
2017) implementation, including GPU-based implementations of several standard ODE solvers at 
github. com/rtqichen/torchdiffeq. 


3 Replacing residual networks with ODEs for supervised learning 
In this section, we experimentally investigate the training of neural ODEs for supervised learning. 


Software To solve ODE initial value problems numerically, we use the implicit Adams method 
implemented in LSODE and VODE and interfaced through the scipy. integrate package. Being 
an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires 
solving a nonlinear optimization problem at every step. This setup makes direct backpropagation 
through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd 
framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden 
state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the 
Fortran ODE solvers, which were called from Python autograd code. 


Model Architectures We experiment with a Table 1: Performance on MNIST. From LeCun 


small residual network which downsamples the €t al. (1998). 
input twice then applies 6 standard residual 


blocks He et al. (2016b), which are replaced Test Error #Params Memory Time 
by an ODESolve module in the ODE-Net vari- 1-Layer MLP? 1.60% 0.24M - - 
ant. We also test a network with the same archi- ResNet 0.41% 060M O(L) O(L) 
tecture but where gradients are backpropagated RK-Net 0.47% 0.22M OL) OW 
ODE-Net 0.42% 022M O(1) O(L) 


directly through a Runge-Kutta integrator, re- 
ferred to as RK-Net. Table 1 shows test error, number of parameters, and memory cost. L denotes 
the number of layers in the ResNet, and L is the number of function evaluations that the ODE solver 
requests in a single forward pass, which can be interpreted as an implicit number of layers. We find 
that ODE-Nets and RK-Nets can achieve around the same performance as the ResNet. 


Error Control in ODE-Nets ODE solvers can approximately ensure that the output is within a 
given tolerance of the true solution. Changing this tolerance changes the behavior of the network. 
We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is 
proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a 


trade-off between accuracy and computational cost. One could train with high accuracy, but switch to 
a lower accuracy at test time. 
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Figure 3: Statistics of a trained ODE-Net. (NFE = number of function evaluations.) 


Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly 
half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory 
efficient, but also more computationally efficient than directly backpropagating through the integrator, 
because the latter approach will need to backprop through each function evaluation in the forward 
pass. 


Network Depth It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is 
the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver 
and dependent on the initial state or input. Figure 3d shows that he number of function evaluations 
increases throughout training, presumably adapting to increasing complexity of the model. 


4 Continuous Normalizing Flows 


The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and 
the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to 
compute exact changes in probability if samples are transformed through a bijective function f: 

of 


t pote 
de az 


An example is the planar normalizing flow (Rezende and Mohamed, 2015): 


zı = f(Zo) => logp(zı) = log p(zo) — log (6) 


z(t +1) =2(t)+uh(w'a2(t) +b), logp(z(t+ 1)) = log p(a(t)) — log 


ðh 


Generally, the main bottleneck to using the change of variables formula is computing of the deter- 
minant of the Jacobian °f/oz, which has a cubic cost in either the dimension of z, or the number 
of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow 
layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018). 


Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the 
computation of the change in normalizing constant: 


Theorem 1 (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable 


with probability p(z(t)) dependent on time. Let @ = f (z(t), t) be a differential equation describing 
a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z 


and continuous in t, then the change in log probability also follows a differential equation, 


O log p(a(t)) df 
a (a5) (8) 


Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation. 
Also unlike standard finite flows, the differential equation f does not need to be bijective, since if 
uniqueness is satisfied, then the entire transformation is automatically bijective. 


As an example application of the instantaneous change of variables, we can examine the continuous 
analog of the planar flow, and its change in normalization constant: 
dz(t) alog p(z(t)) + Oh 


— T \ = 
a = uh(w z(t) + b), at =—u Da(t) (9) 


Given an initial distribution p(z(0)), we can sample from p(z(t)) and evaluate its density by solving 
this combined ODE. 


Using multiple hidden units with linear cost While det is not a linear function, the trace function 
is, which implies tr(}>,, Jn) = >, tt(Jn). Thus if our dynamics is given by a sum of functions then 
the differential equation for the log density is also a sum: 


M M 
= hW) ates na) =o (+) (10) 


This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in 
the number of hidden units M. Evaluating such ‘wide’ flow layers using standard normalizing flows 
costs O(M?), meaning that standard NF architectures use many layers of only a single hidden unit. 


Time-dependent dynamics We can specify the parameters of a flow as a function of t, making the 
differential equation f (z(t), t) change with t. This is parameterization is a Pind ot hypernetwork (Ha 
et al., 2016). We also introduce a gating mechanism for each hidden unit, de = >, On(t) fn(z) 
where o,,(t) € (0, 1) is a neural network that learns when the dynamic f,,(z) should be applied. We 
call these models continuous normalizing flows (CNF). 


4.1 Experiments with Continuous Normalizing Flows 


We first compare continuous and discrete planar flows at learning to sample from a known distribution. 
We show that a planar CNF with M hidden units can be at least as expressive as a planar NF with 
K = M layers, and sometimes much more expressive. 


Density matching We configure the CNF as described above, and train for 10,000 iterations 
using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using 
RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we 
minimize KL (q(x)||p(x)) as the loss function where q is the flow model and the target density p(-) 
can be evaluated. Figure 4 shows that CNF generally achieves lower loss. 


Maximum Likelihood Training A useful property of continuous-time normalizing flows is that 
we can compute the reverse transformation for about the same cost as the forward pass, which cannot 
be said for normalizing flows. This lets us train the flow on a density estimation task by performing 
maximum likelihood estimation, which maximizes E,x) [log q(x)] where q(-) is computed using 
the appropriate change of variables theorem, then afterwards reverse the CNF to generate random 
samples from q(x). 


For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 5 
shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the 
K=8 K=32 M=8 M=32 


K=2 


(a) Target (b) NF (c) CNF (d) Loss vs. K/M 


Figure 4: Comparison of normalizing flows versus continuous normalizing flows. The model capacity 
of normalizing flows is determined by their depth (K), while continuous normalizing flows can also 
increase capacity by increasing width (M), making them easier to train. 
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(a) Two Circles (b) Two Moons 


Figure 5: Visualizing the transformation from noise to data. Continuous-time normalizing flows 
are reversible, so we can train on a density estimation task and still be able to sample from the learned 
density efficiently. 


transformed distribution after a small amount of time which shows the locations of the initial planar 
flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that 
the particles can be evenly spread into circles. While the CNF transformations are smooth and 
interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting 
the two moons dataset in Figure 5b. 


5 A generative latent function time-series model 


Applying neural networks to irregularly-sampled data such as medical records, network traffic, or 
neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the 
latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill- 
defined latent variables. Missing data can be addressed using generative time-series models (Alvarez 
and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data 
imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of 
an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017). 


We present a continuous-time, generative approach to modeling time series. Our model represents 
each time series by a latent trajectory. Each trajectory is determined from a local initial state, Z+, and 


a global set of latent dynamics shared across all time series. Given observation times to, t,,..., tN 

and an initial state z;,, an ODE solver produces Z;,,..., Zt, Which describe the latent state at each 
observation. We define this generative model formally through a sampling procedure: 

Zto ~ p(Zty) (11) 

Zt, )Zta,+++5Zty = ODESolve(zz,, f 0, to,---, tw) (12) 

each xz, ~ p(x|Z¢,, 0x) (13) 


Function f is a time-invariant function that takes the value z at the current time step and outputs the 
gradient: 92(*)/or = f (z(t), 07). We parametrize this function using a neural net. Because f is time- 
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Figure 6: Computation graph of the latent ODE model. 


invariant, given any latent state z(t), the entire latent trajectory is uniquely defined. Extrapolating 
this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time. 


Training and Prediction We can train this latent-variable model as a variational autoen- 
coder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our 
recognition net is an RNN, which consumes the data sequentially backwards in time, and out- 
puts qe (Zo|X1, X2,- --, Xy). A detailed algorithm can be found in Appendix E. Using ODEs as a 
generative model allows us to make predictions for arbitrary time points t;...t,¢ on a continuous 


timeline. 


Poisson Process likelihoods The fact that an observation oc- 
curred often tells us something about the latent state. For ex- 
ample, a patient may be more likely to take a medical test if 
they are sick. The rate of events can be parameterized by a 
function of the latent state: p(event at time t| z(t)) = A(z(t)). 
Given this rate function, the likelihood of a set of indepen- 
dent observation times in the interval [tstat; tena] is given by an 
inhomogeneous Poisson process (Palm, 1943): 


tend 


N 
log p(t tee ty] Estarts tend) = 5 log Alz(ti)) == A(z(t))dt 
i=1 


Estart 


We can parameterize \(-) using another neural network. Con- 
veniently, we can evaluate both the latent trajectory and the 


t 


Figure 7: Fitting a latent ODE dy- 
namics model with a Poisson pro- 
cess likelihood. Dots show event 
times. The line is the learned inten- 
sity A(t) of the Poisson process. 


Poisson process likelihood together in a single call to an ODE solver. Figure 7 shows the event rate 


learned by such a model on a toy dataset. 


A Poisson process likelihood on observation 
times can be combined with a data likelihood to 
jointly model all observations and the times at 
which they were made. 


5.1 Time-series Latent ODE Experiments (a) Recurrent Neural Network 


We investigate the ability of the latent ODE 
model to fit and extrapolate time series. The 
recognition network is an RNN with 25 hidden 
units. We use a 4-dimensional latent space. We 
parameterize the dynamics function f with a 
one-hidden-layer network with 20 hidden units. 


The decoder computing p(x;,|z,,) is another (b) Latent Neural Ordinary Differential Equation 
neural network with one hidden layer with 20 — Ground Truth = 

hidden units. Our baseline was a recurrent neu- e Observation 

ral net with 25 hidden units trained to minimize — Prediction 

negative Gaussian log-likelihood. We trained a = Extrapolation 


second version of this RNN whose inputs were 
concatenated with the time difference to the next 
observation to aid RNN with irregular observa- 
tions. 


(c) Latent Trajectories 


Figure 8: (a): Reconstruction and extrapolation 


Bi-directional spiral dataset We generated 
a dataset of 1000 2-dimensional spirals, each 
starting at a different point, sampled at 100 
equally-spaced timesteps. The dataset contains 
two types of spirals: half are clockwise while 
the other half counter-clockwise. To make the 
task more realistic, we add gaussian noise to the 
observations. 
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of spirals with irregular time points by a recurrent 
neural network. (b): Reconstructions and extrapo- 
lations by a latent neural ODE. Blue curve shows 
model prediction. Red shows extrapolation. (c) A 
projection of inferred 4-dimensional latent ODE 
trajectories onto their first two dimensions. Color 
indicates the direction of the corresponding trajec- 
tory. The model has learned latent dynamics which 
distinguishes the two directions. 


Figure 9: Data-space trajectories decoded from varying one dimension of z;,. Color indicates 
progression through time, starting at purple and ending at red. Note that the trajectories on the left 
are counter-clockwise, while the trajectories on the right are clockwise. 


Time series with irregular time points To generate irregular timestamps, we randomly sample 
points from each trajectory without replacement (n = {30, 50, 100}). We report predictive root- 
mean-squared error (RMSE) on 100 time points extending beyond those that were used for training. 
Table 2 shows that the latent ODE has substantially lower predictive RMSE. 


Figure 8 shows examples of spiral reconstruc- Table 2: Predictive RMSE on test set 


tions with 30 sub-sampled points. Reconstruc- . 
tions from the latent ODE iii obtained by sam- # Observations 30/100 50/100 100/100 
pling from the posterior over latent trajectories RNN 0.3937 0.3202 0.1813 
and decoding it to data-space. Examples with Latent ODE 0.1642 0.1502 0.1346 
varying number of time points are shown in Ap- 

pendix F. We observed that reconstructions and extrapolations are consistent with the ground truth 
regardless of number of observed points and despite the noise. 


Latent space interpolation Figure 8c shows latent trajectories projected onto the first two dimen- 
sions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to 
clockwise spirals, the other to counter-clockwise. Figure 9 shows that the latent trajectories change 
smoothly as a function of the initial point z(to), switching from a clockwise to a counter-clockwise 
spiral. 


6 Scope and Limitations 


Minibatching The use of mini-batches is less straightforward than for standard neural networks. 
One can still batch together evaluations through the ODE solver by concatenating the states of each 
batch element together, creating a combined ODE with dimension D x K. In some cases, controlling 
error on all batch elements together might require evaluating the combined system K times more 
often than if each system was solved individually. However, in practice the number of evaluations did 
not increase substantially when using minibatches. 


Uniqueness When do continuous dynamics have a unique solution? Picard’s existence theo- 
rem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is 
unique if the differential equation is uniformly Lipschitz continuous in z and continuous in t. This 
theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities, 
such as tanh or relu. 


Setting tolerances Our framework allows the user to trade off speed for precision, but requires 
the user to choose an error tolerance on both the forward and reverse passes during training. For 
sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation 
experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading 
performance. 


Reconstructing forward trajectories Reconstructing the state trajectory by running the dynamics 
backwards can introduce extra numerical error if the reconstructed trajectory diverges from the 
original. This problem can be addressed by checkpointing: storing intermediate values of z on the 
forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We 
did not find this to be a practical problem, and we informally checked that reversing many layers of 
continuous normalizing flows with default tolerances recovered the initial states. 


7 Related Work 


The use of the adjoint method for training continuous-time neural networks was previously pro- 
posed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The 
interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research 
into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al., 
2017). We demonstrate these same properties in more generality by directly using an ODE solver. 


Adaptive computation One can adapt computation time by training secondary neural networks 
to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al., 
2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training 
and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied, 
computationally cheap, and generalizable rules for adapting the amount of computation. 


Constant memory backprop through reversibility Recent work developed reversible versions 
of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives 
the same constant memory advantage as our approach. However, these methods require restricted 
architectures, which partition the hidden units. Our approach does not have these restrictions. 


Learning differential equations Much recent work has proposed learning differential equations 
from data. One can train feed-forward or recurrent neural networks to approximate a differential 
equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applica- 
tions such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting 
Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit 
differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and 
interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic 
variational inference to recover the solution of a given stochastic differential equation. 


Differentiating through ODE solvers The dolfin library (Farrell et al., 2013) implements adjoint 
computation for general ODE and PDE solutions, but only by backpropagating through the individual 
operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient 
estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity 
analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is 
linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint 
method to train bespoke latent dynamic models. 


In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained 
end-to-end with any other differentiable model components. While use of vector-Jacobian products 
for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson 
et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers 
into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling. 


8 Conclusion 


We investigated the use of black-box ODE solvers as a model component, developing new models 
for time-series modeling, supervised learning, and density estimation. These models are evaluated 
adaptively, and allow explicit control of the tradeoff between computation speed and accuracy. 
Finally, we derived an instantaneous version of the change of variables formula, and developed 
continuous-time normalizing flows, which can scale to large layer sizes. 
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Appendix A Proof of the Instantaneous Change of Variables Theorem 


Theorem (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability 
p(z(t)) dependent on time. Let Z = f(z(t),t) be a differential equation describing a continuous-in-time 
transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the 
change in log probability also follows a differential equation: 


O log p(z(t)) df 
a =o @ ©) 


Proof. To prove this theorem, we take the infinitesimal limit of finite changes of log p(z(t)) through time. First 
we denote the transformation of z over an £ change in time as 


z(t +e) = T-(2(t)) (14) 


We assume that f is Lipschitz continuous in z(t) and continuous in t, so every initial value problem has a unique 
solution by Picard’s existence theorem. We also assume z(t) is bounded. These conditions imply that f, Te, and 


ZT. are all bounded. In the following, we use these conditions to exchange limits and products. 
We can write the differential equation Glos pal) 
definition of the derivative: 


Olog p(z(t)) _ iad log p(z(t)) — log |det £T.(2(t))| — log p(z(t)) 


using the discrete change of variables formula, and the 


ot e30t ai om) 
a 
7 on log |det ZT. (2(t))| (16) 
e>ot E 
2 log |det ZT. (z(t 
== jim Be lealde Da c) (by L’Hopital’s rule) (17) 
e>ot ka 
det t 
~~ lim 2! ot az OU) (See. = 1) (18) 
e>0+ |det 2T.(2(t))| ðz |z 
ð o 1 
= li det —T- (z(t lim — —W_ 19 
( im, ge | on ») (2, [det ham) PA 
= Oe 
bounded ST 
lim det 3 T-(z(t)) (20) 
~ e-30+ ÔE Oz * 
The derivative of the determinant can be expressed using Jacobi’s formula, which gives 
Olog p(z(t)) _ . .( 9 00 
r or imeti adj Bg EO) zg EO) (21) 
ð 00 
= —tr (im, adj (Z7r-«w)) ) (im, 3. De aa -(a(0)) (22) 
jee SS 
=I 
00 
7 -r (lim, 5 Oe oa (a(t) (23) 


Substituting Te with its Taylor series expansion and taking the limit, we complete the proof. 


Alog p(z(t)) _ -r ( À 


Ot (24) 


re (z + ef (z(t), ) +0) +06) +...)) 
2. : PARAO + O(e?) + O(e?) +...)) (25) 


(im, 3: 
= — (lim, (Zret t)+O(e +01) +...)) (26) 
( 


2 Fait (27) 
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A.1 Special Cases 


Planar CNF. Let f(z) = uh(w” + b), then of = uL", Since the trace of an outer product is the inner 
product, we have 


ðlogp(z) _ e( ar) age Oh 


ot = Oz Oz (28) 


This is the parameterization we use in all of our experiments. 


Hamiltonian CNF. The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits 


the data into two equal partitions and is a volume-preserving transformation, implying that 2log ptz) = 0. We 
can verify this. Let 


dz: 
“a | _ | (zat) 
EA = | g(z1-a) | 29) 


Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow. 


A.2 Connection to Fokker-Planck and Liouville PDEs 


The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability 
density function of a stochastic differential equation as it changes with time. We relate the instantaneous change 
of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation. 


As with the instantaneous change of variables, let z(t) € R? evolve through time following tat) = f(z(t),t). 
Then Liouville equation describes the change in density of z-a fixed point in space—as a PDE, 


dpt) L 


Da gz iz: DP t) (30) 


However, (30) cannot be easily used as it requires the partial derivatives of 7 Best) © which is typically approximated 
using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999). 


Instead of evaluating p(-, t) at a fixed point, if we follow the trajectory of a particle z(t), we obtain 


opz) t) plz oz) Ap(z(t), t) 
at dz(t) ðt at 
—1{Á— m w— 
partial derivative from first argument, z(t) partial derivative from second argument, t 


Ww 


= yes y etet (alt),t) — So flat er RES GD 


> ee Sy, 9 


i=1 


We arrive at the instantaneous change of variables by taking the log, 


log p(z(t),t) 1  Əplz(t), t) _ 3 ACORN) 
Ot p(z(t), t) Ot Oz; 


(32) 


i=1 
While still a PDE, (32) can be combined with z(t) to form an ODE of size D + 1, 

d z(t) _ f(t), t) 

dt heer ee.) = |- yee (33) 


Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more 
practical impact as it can be numerically solved much more easily, requiring an extra state of D for following 
the trajectory of z(t). Whereas an approach based on finite difference approximation of the Liouville equation 
would require a grid size that is exponential in D. 


Appendix B A Modern Proof of the Adjoint Method 


We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow. 
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B.1 Continuous Backpropagation 


Let z(t) follow the differential equation da(t) = f(z(t),t,@), where 0 are the parameters. We will prove that if 
we define an adjoint state 
a(t) = (34) 
dz(t) 
then it follows the differential equation 
aa) a(t) eae 0) (35) 


For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors. 


The adjoint state is the gradient with respect to the hidden state at a specified time t. In standard neural networks, 
the gradient of a hidden layer h; depends on the gradient from the next layer hi+1 by chain rule 


dL dL dhii 


dh: dh dh: B6) 
With a continuous hidden state, we can write the transformation after an £ change in time as 
t+e 
at+e)= f FU) Odi + a(t) = Talal) t) G7) 
t 
and chain rule can also be applied 
dL _— dL da(t+e) : _ OT: (z(t), t) 
Oz(t) dz(t+e) dz(t) 5 a = AEE Oz(t) (38) 
The proof of (35) follows from the definition of derivative: 
da(t) _ lim a(t +e) — a(t) (39) 
dt es0t E 
alt+ e alt + £) T(z(t 
= lim Ete) — alt + ehas Tela) (by Eq 38) (40) 
e0t E 
a(t+e)—a(t+e aon a(t) + ef (z(t), t,0) + Ole? 
= lim l ) ( uz) (t) eee) ©) (Taylor series around z(t)) 
esot € 
(41) 
a(t+e)-a(t+e) (1 + cfGObo) + ole) 
= lim (42) 
e> 0+ E 
Of (z(t),t,0) 2 
—ea(t+ ec) See + Ole 
= lim C+ oT a (e1 (43) 
e>0t E 
Of (z(t), t, 0) 
= im, a(t+ e) Da(t) + Ole) (44) 
Af (a(t), t,8) 
t : 4 
a(t) ae) (45) 


We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropaga- 
tion, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time 
point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to 
the hidden state at any time, including the initial value. 


dL to da(t) 0 Of (z(t), t, 0) 
tn) = to) =a(t H dt = a(t t) ~*~" (46 
altw) = 375 alto) = alin) + f SEP de= alta) = fay SEO as) 
S o 
initial condition of adjoint diffeq. gradient wrt. initial value 


Here we assumed that loss function L depends only on the last time point ty. If function L depends also on 
intermediate time points t1, t2, . . . , tN —1, etc., we can repeat the adjoint step for each of the intervals [t N—1,t N], 
[t N—2,¢ n-1| in the backward order and sum up the obtained gradients. 


B.2 Gradients wrt. 0 and t 


We can generalize (35) to obtain gradients with respect to 0-a constant wrt. t—and and the initial and end times, 
to and ty. We view 0 and t as states with constant differential equations and write 
06(t) dt(t) 


oa a ml (7) 
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We can then combine these with z to form an augmented state! with corresponding differential equation and 
adjoint state, 


d |Z f([z,0,t]) a dL dL 
dt : (t) = faug ([z, 9, t]) = | > aaug += i , a(t) = dO(t)’ a, (t) = dt(t) (48) 


Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the 
previous section still hold as this is a special case of a time-variant ODE. The Jacobian of f has the form 


ə of of of 

Oz eX) ð 
Ofeug _ | 9 P y (t) (49) 
Oz, 0, t] 0 0 0 


where each O is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain 


Aaug ð aug 
e =—fa(t) a(t) a,(t)] TAA 


The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain 
the total gradient with respect to the parameters, by integrating over the full interval and setting ao (tn) = 0. 


grt- fay EGA a 6D 


(t) =- [af a3% a%] © (50) 


Finally, we also get gradients with respect to to and ty, the start and end of the integration interval. 


= = alty) f(z(tN),tN, 0) = Sato Salin) L e dt 69 


Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver. 
Appendix C Full Adjoint sensitivities algorithm 
This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration. 


Algorithm 2 Complete reverse-mode derivative of an ODE initial value problem 


Input: dynamics parameters 0, start time to, stop time t4, final state z(t), loss gradient ĉŁ/əz(tı) 


ge = wees felti), t1, 0) > Compute gradient w.r.t. tı 
so = [z(t1), O Oj), =F > Define initial augmented state 
def aug_dynamics([z(t), a(t), -,-], t, 0): > Define dynamics on augmented state 
return [/(z(t),t,0), —a(t)! of ,—a(t)! of ,—a(t)! of] > Compute vector-Jacobian products 
[z(to), PONE or $e] = ODESolve(so, aug_dynamics, tı, to, 0) > Solve reverse-time ODE 
return ety on oe : ot > Return all gradients 


‘Note that we’ve overloaded t to be both a part of the state and the (dummy) independent variable. The 
distinction is clear given context, so we keep t as the independent variable for consistency with the rest of the 
text. 
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Appendix D Autograd Implementation 
import scipy.integrate 


import autograd.numpy as np 

from autograd.extend import primitive, defvjp_argnums 
from autograd import make_vjp 

from autograd.misc import flatten 

from autograd. builtins import tuple 


odeint = primitive(scipy.integrate.odeint) 


def grad_odeint_all(yt, func, yO, t, func_args, x**xkwargs): 
# Extended from "Scalable Inference of Ordinary Differential 
# Equation Models of Biochemical Processes", Sec. 2.4.2 
# Fabian Froehlich, Carolin Loos, Jan Hasenauer, 2017 
# https ://arxiv.org/pdf/1711.08079. pdf 


T, D = np.shape(yt) 
flat_args , unflatten = flatten(func_args) 


def flat_func(y, t, flat_args): 
return func(y, t, *xunflatten(flat_args )) 


def unpack(x): 


# y, vjp_y. vjp_t, vjp_args 
return x[0:D], x[D:2 * D], x[2 * D], x[2 x D+ 1:] 


def augmented_dynamics(augmented_state, t, flat_args): 
# Orginal system augmented with vjp_y, vjp_t and vjp_args. 
y, VjJp_y, _, _ = unpack(augmented_state) 
vjp_all, dy_dt = make_vjp(flat_func , argnum=(0, 1, 2))(y, t, 
vjp_y. vjp_t, vjp_args = vjp_all(—vjp_y) 
return np.hstack((dy_dt, vjp_y, vjp_t, vjp_args)) 


def vjp_all(g,**kwargs ): 


vjp_y = g[-l, :] 


vjp_tO = 0 
time_vjp_list = [] 
vjp_args = np. zeros(np.size(flat_args )) 


for i in range(T — 1, 0, —1): 


# Compute effect of moving current time. 

vjp_cur_t = np.dot(func(yt[i, :], t[i], *func_args), g[i, 
time_vjp_list.append(vjp_cur_t) 

vjp_tO = vjp_tO — vjp_cur_t 


# Run augmented system backwards to the previous observation. 


aug_y0 = np.hstack((yt[i, :], vjp_y, vjp_t0O, vjp_args)) 
aug_ans = odeint(augmented_dynamics, aug_y0, 


np.array([t[i], t[i — 1]]), tuple((flat_args ,)), 


_, Vvjp_y, vjp_t0O, vjp_args = unpack(aug_ans[1]) 


# Add gradient from current output. 
vjp-y = vjp-y + Bla = 1, :] 


time_vjp_list.append(vjp_t0) 
vjp_times = np.hstack(time_vjp_list)[::—1] 


return None, vjp_y, vjp_times, unflatten(vjp_args) 
return vjp_all 
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flat_args) 


** kwargs ) 


def grad_argnums_wrapper(all_vjp_builder ): 
# A generic autograd helper function. Takes a function that 
# builds vjps for all arguments, and wraps it to return only required vjps. 
def build_selected_vjps(argnums, ans, combined_args, kwargs): 
vjp_func = all_vjp_builder(ans, *combined_args, **xkwargs) 
def chosen_vjps(g): 
# Return whichever vjps were asked for. 
all_vjps = vjp_func(g) 
return [all_vjps[argnum] for argnum in argnums ] 
return chosen_vjps 
return build_selected_vjps 


defvjp_argnums(odeint , grad_argnums_wrapper(grad_odeint_all)) 


Appendix E Algorithm for training the latent ODE model 


To obtain the latent representation z;,, we traverse the sequence using RNN and obtain parameters of distribution 
q(Zto|{Xt;, tibi, Benc). The algorithm follows a standard VAE algorithm with an RNN variational posterior and 
an ODESolve model: 


1. Run an RNN encoder through the time series and infer the parameters for a posterior over Zi): 
q(Zto |{X2;,ti}s, Q) = N (2to|Maeg > Fz0)> (53) 
where Hzo, Czo comes from hidden state of RNN({ xz, , ti}i, 0) 


2. Sample Zt, ~ q(Ztol{Xt;, ti}s) 


3. Obtain Z: ,Zt.,-.-,Ze,, by solving ODE ODESolve(zi,, f, 97, to,..-,tar), where f is the function 
defining the gradient dz/dt as a function of z 


4, Maximize ELBO = 977", log p(X; |Z; Ox) + log P(Zto) — log q(ato|{xe;, tihi; O), 
where p(Zt,) = M (0, 1) 


Appendix F Extra Figures 


= Ground Truth 
e Observation 

= Prediction 

= Extrapolation 


(a) 30 time points (b) 50 time points (c) 100 time points 


Figure 10: Spiral reconstructions using a latent ODE with a variable number of noisy observations. 
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